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[57] ABSTRACT 

The multimodal telephone prompts the user using both a 
visual display and synthesized voice. It receives user input 
via keypad and programmable soft keys associated with the 
display, and also through user-spoken commands. The voice 
module includes a two stage speech recognizer that models 
speech in terms of high similarity values. A dialog manager 
associated with the voice module maintains the visual and 
verbal systems in synchronism with one another. The dialog 
manager administers a state machine that records the dialog 
context. The dialog context is used to ensure that the 
appropriate visual prompts are displayed — showing what 
commands are possible at any given point in the dialog. The 
speech recognizer also uses the dialog context to select the 
recognized word candidate that is appropriate to the current 
context. 

13 Claims, 14 Drawing Sheets 
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MULTIMODAL VOICE DIALING DIGITAL 
KEY TELEPHONE WITH DIALOG 
MANAGER 

BACKGROUND AND SUMMARY OF THE 
INVENTION 

The present inventioa relates generally to digital tele- 
phones and telephone systems, such as private branch 
exchange (PBX) systems. More particularly the invention 
relates to a multimodal telephone that provides both voice 
and touchpad control throu^ an integrated system employ- 
ing speech recognition and speech generation together with 
optical display such as an LCD panel. The user communi- 
cates with the telephone to perform voice dialing and other 
system control functions by interacting with the integrated 
dialog manager that ensures the voice mode and visual/ 
touchpad mode remain synchronized. 

The telephone has evolved quite considerably since Alex- 
ander Graham Bell. Today, complex telephone stations con- 
nect to sophisticated switching systems to perform a wide 
range of different telecommunication functions. Indeed, the 
modern-day telephone device has become so sophisticated 
that the casual user needs an instruction manual to be able 
to operate it. The typical modem-day telephone device 
features a panoply of different function buttons, including a 
button to place a conference call, a button to place a party 
on hold, a button to flash the receiver, a button to select 
different outside lines or extensions and buttons that can be 
programmed to automatically dial different frequently called 
numbers. Clearly, there is a practical limit to the number of 
buttons that may be included on the telephone device, and 
that limit is rapidly being approached. 

It has been suggested that voice operated telephones may 
provide the answer. With a sufficiently robust speech 
recognizer, the telephone could, in theory, be controlled 
entirely by voice. It is doubtful that such a device could be 
successfully achieved using today's technology; simply 
incorporating speech recognition into the telephone would 
not result in a device that is easy to use. 

Anyone who has been caught in the endless loop of a 
voice mail system will understand why voice control of the 
telephone is a significant challenge. It is difficult to offer the 
telephone user a wide assortment of control functions and 
operations when those options are prompted by speech 
synthesis and must be responded to by voice. The user 
typically has difficulty remembering all of the different 
choices that are possible and difficulty remembering what 
the precise commands are to invoke those operations. Also, 
speech recognizers will occasionally misinterpret a user's 
command, resulting in the need to abort the command or 
enter it again. If the user's speech differs significantly from 
the model on which the recognizer has been trained, the 
recognizer may also fail to recognize the abort command. 
When this happens the system may execute an unwanted 
command, causing user frustration and inconvenience. 

The problem is compounded when voice diahng is 
desired, because voice dialing significantly increases the 
size of the dictionary of words that must be recognized. 
Essentially, every new name that is added to the phone 
directory becomes another word that must be properly 
interpreted by the recognizer. 

The present invention solves the problem with a new 
approach that integrates voice prompts, visual prompts, 
spoken commands and push button commands so that the 
user always has a choice. The telephone includes a dialog 
manager that monitors the user's spoken commands and 
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push button commands, maintaining both modes in synchro- 
nism at all times. The result is a natural, easy-to-use system 
that does not require an extensive user's manual. The dialog 
manager displays the commands that are possible, which the 

5 user can select by pressing the soft key buttons on the 
keypad adjacent the visual display or by speaking the 
commands into the handset. The soft key buttons are push 
buttons whose function changes according to the state of the 
dialog. The current function of the soft key button is 
indicated on the visual display adjacent the button. As the 
user is first learning the system the visual display provides 
convenient prompts so that the user will always know what 
commands are possible at any given time. As the user begins 
to learn these commands he or she may choose to simply 
enter them by speaking into the handset, without even 
looking at the visual display. Of course, even the experi- 
enced user may occasionally choose to use the soft key push 
buttons — when the user cannot use the spoken commands or 
when entering an abort command to cancel an earlier 
command that was misinterpreted by the recognizer. 

The preferred embodiment of the telephone system is 
implemented in a modular way, with the voice recognition 
and synthesis functions as well as the dialog manager being 
disposed on a circuit card that plugs into a separate card 
supporting the touchpad, soft keys and visual display func- 
tions. 'ITie preferred architecture allows the telephone to be 
manufactured either with or without voice capability or the 
sophisticated dialog manager. Later, these features can be 
added to the telephone by simply plugging in the voice card. 

3Q By way of summary, the multimodal telephone of the 
invention comprises a telephone unit having a microphone 
and a speaker for supporting voiced communication by a 
, user. The microphone and speaker may be incorporated into 
the handset of the telephone unit according to conventional 

35 practice, or they may be separate from the handset. A visual 
display device is disposed on the telephone unit, the display 
being adapted for displaying a plurality of different com- 
mand prompts to the user. The presently preferred embodi- 
ment employs a multiline liquid crystal display (LCD) for 
this purpose. The multimodal telephone further comprises at 
least one programmable function key for entry of keyed 
commands by the user. The function key is disposed on the 
telephone unit adjacent the visual display, so that at least a 
portion of the command prompts are displayed approxi- 

45 mately adjacent the function key. ITie preferred embodiment 
uses several such function keys, with adjacent command 
prompts defining the current function of the key. 

A speech module is disposed in the telephone unit. The 
speech module includes a voice recognizer and a speech 

50 generator or synthesizer. The speech module is coupled to 
the telephone unit so that the voice recognizer is responsive 
to voiced commands entered through the microphone, and 
the speech synthesizer provides audible prompts through the 
speaker. 

55 The multimodal telephone further comprises a dialog 
manager coupled to the visual display as well as to the 
function keys and the speech module. The dialog manager 
defines a hierarchically arranged set of control function 
states. Each state is associated with one of the command 

60 prompts and at least a portion of the states are further 
associated with one of the audible prompts. The dialog 
manager is responsive to the voiced commands, and also to 
the function keys, to traverse the hierarchically arranged set 
of control function states and select one of the control 

65 function states as the active state. 

The dialog manager is operative to maintain synchronism 
between the command prompts and the audible prompts. 
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The dialog manager is also operative to maintain synchro- provides contextual meaning for keys 26, shown at 28. The 

nism between voiced commands and keyed commands, so LCD 24 is also integrated with telephone voice recognition 

that the state hierarchically adjacent to the active state is and processing circuitry to display telephone command 

displayed as a command prompt and the user has the option prompts in response to keyed -in or voice commands, as will 

to move from the active state to the hierarchically adjacent 5 tig described in detail below. 

state by either voiced command or keyed command. 

For a more complete understanding of the invention, its , ^^^^^^^ ^ overaU system block diagram of 

objects and advantages, reference may be had to the follow- components of the telephone 10 shown generally at 40. 

ing specification and drawings and to the pseudocode listing telephone 10 communicates with a private branch 

in the Appendix. exchange (PBX) 42, which in turn is connected to a public 

switched telephone network. However, the telephone 10 

BRIEF DESCRIPTION OF THE DRAWINGS may be connected to the public switched telephone network 

r^f^ ^ • , r u' ' ji- directly or through well-known means other than the PBX 

FIG. 1 IS an elevation view of a multimodal voice diahng ai 

digital telephone according to a preferred embodiment; 

FIGS. 2a and 2b (collectively referred to as FIG. 2) are Still referring to FIG. 3, the telephone also has a phone 

views of alternative displays that may be used in the processor 46 that handles basic phone operation such as 

telephone of FIG. 1; handling keypad input and writing to the display 24. The 

FIG. 3 is a block diagram of the components comprising speech module 52 is connected to the phone processor 46 to 

the telephone shown in FIG. 1; 20 add voice command capability to the telephone that func- 

FIG. 4 is a diagram showing the data stored in the ^" parallel with the LCD 24 in accordance with the 

telephone database shown in FIG, 3; present invention. The speech module includes a speech 

HG. 5 is a schematic pin^out diagrLn of the processor and Processor 53 that handles speech recognition, synthesis and 

the speech card of the telephone of FIG. 1; operates the dialog manager. The speech processor 53 

HG. 6 is a data flow diagram showing the major func- accesses database 44 to retrieve stored data used in inter- 
tional components of the multimodal telephone system and P^^^^g ^ commands. Phone processor 46 is con- 
how data flows among those systems; "^^^^^ sP^^'^h processor 53. 

FIG. 7 is an overview of a state machine diagram depict- The speech module 52 also includes a speech recognizer 

ing how the respective slate machines of the phone proces- 30 56, a speech synthesizer 58, and a dialog manager 54. The 

sor and the dialog manager are integrated; speech module can be implemented as a separate card that 

FIGS. 8 and 9 collectively represent the state machine of connects to the phone processor 46. The speech recognizer 

the dialog manager, showing what control function states are 56 is responsive to voice commands entered through the 

possible in the preferred embodiment and how those states voice data entry device in accordance with the speech 

are hierarchically arranged; 35 recognition logic described below. The speech synthesizer 

FIG. 10 is a phoneme similarity time series for the word 56 provides audible prompts to the user through the micro- 

"hiU" spoken by two speakers; phone 16 in response to commands from the processor and 

FIG. 11 is a series of graphs showing the output of the the dialog manager 54. 

region picking procedure whereby similarity values are ^ ^^^^^ 4 ^^^^^^^ 44 ^ preferably 

converted mto high similarity regions; u-.- rji c 

^ ^ ^ constmcted using a combination 01 read-only memory for 

FIG. 12 is a block diagram of the presently preferred word static prompts and read/write nonvolatile memory for 

recognizer system, dynamic prompts. More specifically, the read-only memory 

HG. 13 is a block diagram illustrating the target congru- stores the speaker-independent commands. These are key 

ence word prototype training procedure. 45 ^or^s that cause the system to perform various system 

DETAILED DESCRIPTION functions identified in Table 1 below. The user may retrain 

these speaker-independent commands, replacing them with 

Amultimodal voice diaUng digital telephone according to speaker-dependent commands that are then stored in the 

a preferred embodiment of the present invention is shown ^ad/write memory. When a speaker retrains a command, the 

generaUy at 10. TTie telephone 10 is of the type man^^^^^^ 50 speaker-dependem command overrides the speaker- 

tured by Matsi^hita Electnc Industna Company, Ltd. and independent one. Speaker-dependent commands are entered 

mcludes a handset 12 with a speaker 14 and a mouthpiece .u u *u u r *u . 1 u i_ j * t^. 

u ^^T-L.iL 1 jiof- through the microphone 16 of the telephone handset. The 
microphone 16. The telephone also mcludes a keypad 18 for , " , ^ , , .u u ^ 1 * 
entering alphanumeric data into the phone, as is well known '"d-only memory also stores the phone models that are 
inthetelephonicart.Atwowaytransceiver201ocatedbelow 55 f '^^ ^^^^-only memory 
the key pad aUows hands free two way communication »^ /tores static prompts^ese are prompts that are sup- 
between a telephone user (not shown) and the telephone, as P^^^^ '° ^^y^ display 24. Dynamic prompts, 
is also weU known in the telephonic art. representing prompts that can be altered by the user are 

Tu . 1 u in 1 • 1 J 1- -J . 1 J- 1 Stored in read/write memory. Also stored in read-write 

The telephone 10 also mcludes a liquid crystal display 7 , , •' , , . , 

(LCD) 24 that displays commands entered through a plu- 60 "f ""^ "jf speakerKlependent names and abated 

rality of buttons or keys 26. He size of the display Will '^'t T a '^'^^ ""^^ ^ ^f^"'' 

depend upon the styling and functionality desired TTie Speaker^ependent names are entered usmg micro- 

presentlypreferredembodimentusesatwolineLCD.shown Ph°ne 16; the associated telephone numbers are entered 

in greater detail in FIG. 2a. The LCD shown at 24 in RG. "^"^ keypad 1». 

2a is a two line LCD capable of displaying a total of 16 65 The database preferably has enough memory to store at 

characters on each line. An alternate seven line, 16 charac- least 100 names and telephone numbers, along with the other 

ters per line LCD is shown at 24 in FIG. 2fc. The LCD information ilhistrated in FIG. 4. 
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TABLE 1 



KEYWORDS 



System 


Add 


Cancel 


Delete 


Call 


Verify 


Lookup 


Reset 


List 


Restore 


Program 


Complete 


Edit 


All names 


Yes 


Adapt 


No 


Go back 


Next one 




Restart 





A schematic pin out diagram showing the interconnection 
of the processor 46 with the speech module 52 is shown in 
FIG. 5. Signal functions of the processor 46 and the speech 
module 52 are given below in Table 2. 



TABLE 2 



SIGNAL 


IN 


OUT 


FUNCTION 


INT 


ACT NOTE 


ALUN 


X 


X 


Speech card unit sign 




L L:Installed 


ALBO 


X 


X 


Data DO 






ALBl 


X 


X 


Data Dl 






ALB2 


X 


X 


Data D2 






ALB3 


X 


X 


Data D3 






ASTR 




X 


Interface control signal 


H 


L 


AACK 


X 




Speech card ACK 


H 


L 








signal 






AARQ 


X 




Speech card access 


H 


L L:On 



request signal access 



The digital voice telephone of the present invention may 
be operated through use of the keys 26 through voice 
commands processed by the speech module 52, or through 
a combination of both the keys and voice commands. 
Therefore, if, for some reason the speech module 52 is 
disabled, the telephone 10 may function as a conventional 
digital telephone without voice command capability. 

Refer now to FIG. 6. FIG. 6 illustrates the major func- 
tional components of the multimodal telephone of the inven- 
tion. The phone processor or APU 46 supports the display 24 
and also the keypad 18. The speech module 52 comprises the 
dialog manager 54, including the speech recognizer 56 and 
speech synthesizer 58. If the speech module 52 is not 
connected to the APU 46, the APU 46 will nevertheless 
function as a standard touchtone telephone. The APU 
includes its own processor and associated memory that 
define a state machine 90, Specifically, state machine 90 
describes the various telephone operating states that the user 
may place the telephone system in. These states include, for 
example, placing a call on hold, forwarding an incoming call 
to another number, transferring a call, and so forth. These 
states are typically those provided by conventional digital 
telephones for use with PBX systems. The keypad 18 serves 
as the user input to APU 46 and the display 24 serves as the 
user output. 

The telephone of the present invention differs signifi- 
cantly from conventional digital telephones by virtue of the 
dialog manager 54 and its associated speech recognizer and 
speech synthesizer modules. The dialog manager is coupled 
to the APU to support bidirectional communication with the 
APU. The speech recognizer 56 serves as the user input and 
the speech synthesizer 58 serves as the user output. The 
dialog manager defines its own state machine 92. This state 
machine maintains the dialog context. That is, the dialog 
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manager through its state machine 92 maintains a record of 
the current interaction between the user and the telephone, 
including how the user arrived at that point in the dialog, 
where applicable. For example, if the user has entered the 

5 command "call" followed by the name "Carl," the state 
machine 92 stores the fact that the user is attempting to place 
a call, as opposed to storing a telephone number for the party 
"Carl." The dialog context is used by the speech recognizer 
to help determine which is the most likely candidate for 
selection as the recognized word. Thus, in the preceding 
example, the speech recognizer would not confuse the word 
"Carl" for the word "call" because the word "Carl" followed 
the word "call," signifying that the word "Carl" is not a 
command but a name. The dialog context is also used to 
identify which commands are allowed at any given level in 

15 the dialog. By virtue of the bidirectional connection between 
the dialog manager 54 and the APU 46, the allowed com- 
mands at any stage in the dialog are also furnished to the 
display 24. This gives the user a visual indication of what are 
the possible commands at this point in the dialog. 

20 The connection between dialog manager 54 and APU 46 
ensures that these two processors operate in synchronism. 
Thus, if a user selects a soft key 26 associated with a given 
prompt on the display 24, that selection is sent to the dialog 
manager 54, where the information is used to cycle state 

25 machine 92 to the proper dialog context. Alternatively, if the 
user enters a verbal command that is recognized by speech 
recognizer 56, a dialog manager sends the command to APU 
46, where it is carried out just as if the user had entered it 
through the soft key 26 or keypad 18. The dialog manager 

30 is capable of sophisticated processing of a user's input 
before transmitting control commands to the APU, For 
example, the dialog manager upon receipt of a command 
"call Carl" would look the name "Carl" up in database 44 
and obtain the telephone number stored for that party. The 

35 dialog manager would then send commands to APU 46 that 
are interpreted by APU 46 as numeric digits entered via 
keypad 18, In this way, the telephone performs a voice 
dialing function. 

FIG. 7 shows in greater detail how the state machine 90 

40 and state machine 92 integrate with one another. In FIG. 7 
the states of state machine 90 are depicted using circles and 
the top level states of state machine 92 are depicted using 
rectangles. For example, when the user first lifts the handset 
of a telephone to use it, the state machine of APU 46 (state 

45 machine 90) is in the ready call state 200. The user will hear 
a dial tone through the speaker of the handset. From this 
state the user may use the keypad buttons 18 to dial a number 
and enter the conversation state 202, Alternatively, from the 
read call state 200 the user may activate the redial button on 

50 the telephone to enter redial state 204. In this state the APU 
automatically dials the last dialed number, whereupon the 
conversation state 202 is entered. Similarly, the user can 
press a speed dial button that has been previously pro- 
grammed with a frequently used phone number. This causes 

55 the state machine 90 to enter state 206, In this state the APU 
dials the stored number and then enters the conversation 
stale 202. While in the conversation state the user may press 
the hold button, causing the state machine to index to the 
hold state 208. While in the conversation state the user may 

60 also transfer a call by pressing the transfer button on the 
telephone, causing state machine 90 to index to the transfer 
state 210. Similarly, while in the conversation state, the user 
can press the conference call button, causing the state 
machine to index to the conference call state 212. The 

65 transfer and conference call buttons place the call on hold 
while allowing the user to establish contact with another 
party. 
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The preseotly preferred telephone unit includes as one of However, because state machine 92 adds functionality to the 

its function key buttons, a voice key button that, when telephone system that is not found in the APU-driven system 

activated from certain states, will invoke the services of the alone, state machines 90 and 92 do not entirely overlap one 

dialog manager and its associated stale machine 92. In FIG. another. 

7 the voice key state 214 may be entered from some (but not 5 Referring to FIGS. 8 and 9, the ready call state 220 serves 

all)ofthestatesof state machine 90. As illustrated, the voice as the starting point from which the basic top level func- 

key state 214 may be entered from the ready call state 200, tional states 222-236 can be selected. See pseudocode in 

from the hold state 208, from the transfer state 210 and from Appendix for an example of how this top level state may be 

the conference call state 212. Entering this state, in effect, programmed. Each of these functional stales leads to a 

activates the dialog manager The dialog manager begins in plurality of additional states that the user will enter and exit 

the ready call slate 220, which is the primary access point for whQe conducting a dialog with the dialog manager The 

the remaining states of stale machine 92 illustrated at timeout timer 240 (FIG. 7) is set at every state in the dialog 

222-236. Each of the slates of stale machine 92 are unless otherwise specified. In the state diagrams of HCS. 8 

described in detail in connection with FIGS. 8 and 9. and 9, the designation "K" stands for "keyword." In the 

From a ftmctional standpoint, the ready call state 200 of ,s P^^rred embodiment, the commands displayed on the LCD 

state machine 90 and thr^adv call state 220 of state ^^^^^ decreasmg hkehhood order. The preferred 

state machme 90 and the ready call state 220 ot state gj^bodiment uses soft keys to effect scroll up and scroll 

machine 92 coincide. Stated differently, when the voice key functions, aUowing the user to view more options than 

state is entered, the functional states 222-236 of state can be displayed at any one time on the Uquid crystal display 

machine 92 are, m effect, added to the functionality of the ^^^^ gy ^ display technique, the system can be 

telephone umt as defined by state machme 90. Thus, for 20 easily upgraded to add additional commands or functions, 

example, from the call state 222, the dialog manager will simply by adding those additional keywords to the displayed 

obtain the name to lookup by performing speech list. This approach avoids the necessity of reprogramming 

recognition, look up the name in the database and then dial the entire state machine system when new functions are 

the number by sending the appropriate dialing commands to added. 

the APU. Having done this, the system would then be in the 25 The present invention employs a unique compact speech 

conversation state 202, just as if the user had manually representation based on regions of high phoneme similarity 

dialed the number from the ready call state 200. Although values. As shown in FIG, 10, there is an overall consistency 

some of the functional states 222-236 of state machine 92 in the shape of the phoneme similarity time series for a given 

will cause state changes to occur in state machine 90 (as the word. In FIG. 10 phoneme similarity lime series for the word 

voice dialing function does), not all of the them do. 30 "hill" spoken by two speakers are compared. Although the 

However, stale machine 92 serves the additional function of precise wave shapes differ between the two speakers, the 

maintaining a record of the current dialog context; that is, phoneme similarity data nevertheless exhibit regions of 

the context in which the user's input is to be interpreted. The similarity between the speakers. Similar behavior is 

dialog manager maintains a data structure that defines the observed in the phoneme plausibility time series that has 

possible states of state machine 92 as well as how those 35 been described by Gong and Haton in "Plausibility Func- 

states are hierarchically related. This data structure thus tions in Continuous Speech Recognition: The VINICS 

servestodefine what commands are possible from any given System," Speech Communication, Vol. 13, October 1993, 

stale within the state machine. The dialog manager main- pp. 187-196. 

tains a pointer to the currently active state (that is, the state Conventional speech recognition systems match each 

that the user most recently selected). Knowing the currently 40 input utterance to reference templates, such as templates 

active state, the dialog manager consults the data structure to composed on phoneme similarity vectors, as in the model 

determine what are the possible operations that can be speech method (MSM) of Hoshimi et al. In these conven- 

performed from the active stale and what prompts are tional systems the reference speech representation is frame - 

appropriate for the active state. The dialog manager com- based and requires a high data rate, typically 8 to 12 

municates the dialog context to the phone processor that in 45 parameters every 10 to 20 milliseconds. The frame-by-frame 

turn displays what commands are possible upon the liquid alignment that is required with these conventional systems is 

crystal display. In this way, the user will always know what computationally costly and makes this approach unsuitable 

commands are possible by looking at the LCD display. for larger vocabularies, especially when using small hard- 

The presently preferred implementation will automati- ware, 

cally revert from the ready call state 220 to the ready call 50 The present system uses a multistage word recognizer that 

state 200 after a predetermined time has elapsed without any is applied prior to a frame-by-frame alignment, in order to 

action being taken. This is illustrated diagrammatically by reduce the search space and to achieve real time perfor- 

the timer 240 in FIG. 7. The timeout duration will depend on mance improvements. The number of stages in the 

the particular dialog context. For example, the system will recognizer, as well as the computational complexity of each 

wait for a longer time (e.g. 2 minutes) in the top level states, 55 stage and the number of word candidates preserved at each 

such as the ready call state 220. TTie system will wait a stage, can be adjusted to achieve desired goals of speed, 

shorter time (e.g. 2 seconds) when the system is in a lower memory size and recognition accuracy for a particular 

state that provides a default action to automatically take application. The word recognizer uses an initial representa- 

place if the user does not respond. tion of speech as a sequence of multiple phoneme similarity 

The slate machine 92 of the presently preferred embodi- 60 values. However, the word recognizer further refines this 

ment is illustrated in FIGS. 8 and 9. As indicated above, state speech representation to preserve only the interesting 

machine 92 is implemented by the dialog manager 54. regionsof high phoneme similarity. Referring to FIG. 11, the 

Essentially, dialog manager 54 augments the states available interesting regions of high phoneme similarity value are 

through the APU 46 (state machine 90) with additional states represented as high similarity regions. By representing the 

(state machine 92). By virtue of the bidirectional link 65 speech as features at a lower data rate in the initial stages of 

between the dialog manager and the APU, these two state recognition, the complexity of the matching procedure is 

machines work in fiill synchronism with one another. greatly reduced. 
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The multistage word recognizer also employs a unique HS regions over a predefined number of time intervals. The 

scoring procedure for propagating and combining the scores presently preferred embodiment divides words into three 

obtained at each stage of the word recognizer in order to equal time intervals in which each phoneme interval is 

produce a final word decision. By combining the quasi- described by (1) the mean of the number of HS regions 

independent sources of information produced at each stage, s occurring in that interval and (2) a weight that is inversely 

a significant gain in accuracy is obtained. proportional to the square of the variance, which indicates 

The system's architecture features three distinct compo- how reliable the region count is. Specifically for a score 

nents that are applied in sequence on the incoming ^eech to normalized between 0 and 100, the weight would be 100/ 

compute the best word candidate. (variance^+2). These parameters are easily estimated from 

Referring to FIG. 12, an overview of the preseoUy pre- ''^"^S ^f^' 1° currently preferred implementation, 

ferred system will be presented. The first component of the "ch word requires exactly 330 parameters, which corre- 

present system is a phoneme similarity front end 110 that ^P°°'^ f° ^° each over three intervals each 

converts speech signals into phoneme similarity lime series. comprising 55 phoneme units (2 stalisticsx3 intervalsx55 

Speech is digitized at 8 kUohertz and processed by 10th pnoneme units). ^ ^ ^ ^ ■ ^ 

order linear predictive coding (LPC) analysis to produce 10 modelmg was found to be very effective due 

cepstral coefiScients every 100th of a second. Each block of fast ahgnment Ume (033 milhseoonds per test word on 

10 successive frames of cepstral coefficients is compared to " ^P^"^^ workstation) and its high top 10% accuracy. 

55 phoneme reference templates (a subset of the TIMIT , ^h^ prototype is constructed as follows. A 

phoneme units) to compute a vector of multiple phoneme S'^' "!>ef^°f a traimng word or phrase is represented as 

similarity values. The block of analysis frames is then ^° Ume-dependent phoneme similanty data. In the presently 

shifted by one frame at a time to produce a vector of preferred embodiment each utterance is divided mto N tmie 

phoneme similarity values each centisecond (each 100th of Presently each utterance is divided mto three toe 

a second). As illustrated in FIG. 12, the phoneme similarity "^'^f^""*' Y'^^ each tmie interval being represented by data 

front end works in conjunction with a phone model database corresponding to the 55 phonemes. Tlius the presently 

112 that supplies the phoneme reference templates. The implementation represents each utterimce as a 

output of the phoneme similarity front end may be stored in 3x55 vector. □ representmg the utterance as a 3x55 vector 

a suitable memory for conveying the set of phoneme simi- "^^^ e'.e^ent m a given mtenral stores the number of 

larity time series so generated to the word recognizer stages. !f ''".^'y detected for each given phoneme. 

^ , . , . „ Thiis if three occurrences of the phoneme ah occur in the 

TTie word recognizer stages, depicted m FIG. 12 generally 3^ jj^j^^^,^ ^^^^^ 3 ^ ^^^^ ^ ^^^^^ ^^^^^^ 

at 114 compnse the second major component of the system. corresponding t the "ah" phoneme. 

A peak driven procedure is first apphed on the phoneme ^ jj^^^^j^^ ^^^^ ^ performed for 

similanty time senes supplied by front end 110. The peak ^^^^ „f ^^^^ successive utterances of the training word or 

dnven procedure extracts High Sunilanty Regions (HS Specifically, each successive utterance is represented 

Regions). In this process, low peaks and local peaks of 35 asa vector like that of the first utterance. The two vectors are 

phoneme smiilanty values are discarded, as illustrated m jj,^^ ^^^^^ ^^ ^^^^^ ^^^^ 

FIG. IL In the preferred embodiment regions are charac- ^^^^ ^^^^ ,^ ^^^^^ ^ ^^^^ 

erized by 4 parameters: phoiieme symbol, height at the peak nj^intained to keep track of the current number of utterances 

location and tune locations of the left and right frames. Over ^^^^^ ^^^^ combined 

our data corpus, an average of 60 regions per second of ^ inductively or iteratively in this 

speech is observed In HO. 12 the high similarity region ^^^^ ^^^^ 

new utterance being combined with the pre- 

extraction module U6 performs the peak dnven procedure. ^^^^^ ^^.^ g^^jj j^at the sum and sum of squares vectors 

TTie output of the HS region extraction module is supplied ^,ti„„eiy esent the accumulated data from all of the 

to two difierent word recognizer stages that operate using utterances 

different recognizer techniques to provide a short list of utterances have been processed in this 

word candidates tor the tine match final recognizer stage p u- ♦u ♦ ^ . • i i * j 

^ ^ fashion the vector mean and vector variance are calculated. 

The mean vector is calculated as the sum vector divided by 

The first of the two stages of word recognizer 114 is the the number of utterances used in the traimng set. The vector 

Region Count stage or RC stage 118. This stage extracts a variance is the mean of the squares minus the square of the 

short list of word candidates that are then supplied to the 50 means. The mean and variance vectors are then stored as the 

next stage of the word recognizer 114, the Target Congru- region count prototype for the given word or phrase. The 

ence stage or TC stage 120. The RC stage 118 has an RC same procedure is followed to similarly produce a mean and 

word prototype database 122 that suppUes compact word variance vector for each of the remaining words or phrases 

representations based on the novel compact speech repre- in the lexicon. 

sentation (regions of high phoneme similarity values) of the 55 when a test utterance is compared with the RC prototype, 
invention. Similarly, the TC stage 120 also includes a TC the test utterance is converted into the time dependent 
word prototype database 124 that supplies a different com- phoneme similarity vector, essentially in the same way as 
pact word representation, also based on the compact speech each of the training utterances were converted. The Euclid- 
representation of the invention. The TC stage provides a ean distance between the test utterance and the prototype is 
more selective short list of word candidates, essentially a computed by subtracting the test utterance RC data vector 
further refinement of the list produced by the RC stage 118. from the prototype mean vector and this difference is then 

The word decision stage 126, the final major component squared. The Euchdean distance is then multiplied by a 

of the present system selects the word with the largest score weighting factor, preferably the reciprocal of the prototype 

from the short list supphed by TC stage 120. variance. The weighted Euclidean distance, so calculated, is 

Region Count Modeling 65 then converted into a scalar number by adding each of the 

The RC stage 118 of word recognizer 114 represents each vector component elements. In a similar fashion the weight- 
reference word with statistical information on the number of ing factor (reciprocal of the variance) is converted into a 
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scalar number by adding all of the vector elements. Tbe final process can be performed across a plurality of different 

SCO re is then computed by dividing the scalar distance by the speakers or across a plurality of utterances by the same 

scalar weight. speaker. 

The above process may be repeated for each word in the Next the list of reliable regions, together with the asso- 

prototype lexicon and the most probable word candidates are 5 ciated probabilities of detecting those regions is passed to 

then selected based on the scalar score. the target building module 132. This module builds targets 

Target Congruence Modeling by unifying the region series to produce a list of phoneme 

The second stage of the word recognizer represents each targets associated with each word in the database. This list 

reference word by (1) a prototype which consists of a series of phoneme targets is then supplied to a module 134 that 

of phoneme targets and (2) by global statistics, namely the lO adjusts the target rate by applying the target rate constraint, 

average word duration and the average "match rate," which The target rate constraint (the desired number of targets per 

represents the degree of fit of the word prototype to its second) may be set to a level that achieves the desired target 

training data. In the presently preferred embodiment targets rate. After adjusting the target rate a statistical analyzer 

are generalized HS regions described by 5 parameters: module 136 estimates the global statistics (the average 

1. phoneme symbol; ^5 match rate and the average word duration) and these statis- 

2. target weight (percentage occurrence in training data); f ^ ^"^'f ^^'^^^ at the selected rate are then 

1 u • / f • -1 . . \ stored as the TC word prototype database 124. 

3. average peak height (phoneme similanly value); Recognition 

4. average left frame location; Given an active lexicon of N words, the region count 

5. average right frame location. 20 stage is first applied to produce a short list of word candi- 
Word prototypes are automatically created from the train- dates with normalized scores. A weighted Euchdean dis- 

ing data as follows. First, HS regions are extracted from the tance is used to measure the degree of fit of a test word X 

phoneme similarity time series for a number of training to a reference word P (in RC format as supplied by the RC 

speakers. The training data may be generated based on word prototype database). Specifically, in the current imple- 

speech from a plurality of different speakers or it may be 25 mentation the weighted Euclidean distance is defined as 
based on multiple utterances of the same training words by 

a single speaker. Then, for each training utterance of a word, 3 55 ^ 3 55 

reliable HS regions are computed by aligning the given ^^".-fiyfj ^""^'^'^^ ^^'^'/fiyfj 
training utterance with all other utterances of the same word 

in the training data. This achieves region-to-region align- 30 . . . ^xic^ ■ • . 1 1 r 

^^^^ * o & o" where x^y is the number of HS regions m time mterval I for 

p.*,,.. u c / phoneme j, where P,,. is the corresponding average number 

For each trammg utterance the number of occurrences (or rue • \ j * • • j ? 

u u-1* \ f 1 • ♦u • J A. *u * or HS regions estimated on training data, and where w,-,- is 

probability) of a particular region is then obtained. At that ^ • xt/i a u- u * • ^ j 

f. .^^ .,f u u-i*- 1 *u * ui- u J Ihe corresponding weight. The N/10 highest sconng word 

time, regions with probabilities less than a pre-estabhshed , , ^ ^ z j j-j . j.t_ • 

n r u'lT ^ u f J /. -11 nnc\ c j l ui prototypes are preserved as word Candidates and their SCO res 

Rehability Threshold (typically 0.25) are found unreliable 35 ^ . i- j u j- 

J 1- - * J J * / * * J i_ (weighted Euchdean distances) are normahzed by dividing 

and are ehmmated. The word prototype is constructed by u • j* i u u- u * ^n.- j ^ 

J. ^juL - i .. . c each mdividual score by the highest score. This defines a 

merging reuably detected, high similanty regions to form i- , or j xt i- j 

* A*.t. J c*u * * . \ . • normalized score Sj,^ for each word. Normalized scores 

targets. At the end of that process a target rate constraint (i.e. xi, j j- ■ i i- • -.i 

, . , . c, r j\ • .1. 1- J * range from 0 to 1 and are dimensionless, making It possible 

desired number or targets per second) is then applied to ° , , i ■ r j-^-c - t . 

, , . r .'^ . I if n J • « to combine scores resultmg from different scoring methods, 

obtam a uniform word description level for all the words in 40 ™_ , , * . , ,. j » . 

- ur* * J u The target congruence stage is then applied on each word 

the lexicon. The desired number or targets per second can be .,1 j. tr,^ a - 

i^j^ , *-* u »u ui* candidate selected by the RC stage. A region-to-taiget augn- 

selected to meet system design constramts such as the ability , j - j . ^ & & 

c *u jij** • *r> nient procedure is used to produce a congruence score 

of a given processor to handle data at a given rate. By .. .. . . • , ^ ^- 

* Ti- fu * * 4 J *■ • Tu u c between the test word and a given word reference (m TC 

controUing the target rate a reduction m the number of r ^ 1- jl .i_ j . . j . l \ 

targets is achieved by keeping only the most reUable targets. 45 f"™"' ^ '"PP^"" ^"^fy^ff^^'f} 

Once the word prototype has been obtained in this fashion. ^^S^^^'f^ " target CGmatch. that .s. the 

the average match rate and average word duration are ''•■gnment found between target t of the prototype and region 

. J J * J . f *u J * . J . r of the test word, is defined as 

computed and stored as part of the word prototype data. ' 

The number of parameters needed to represent a word CG„^^f,{t,r)'mmi/ijA^ArlA^ 

depends on the average duration of the word and on the level 50 ^ ^ ^ , . , 

of phonetic detail that is desired. For a typical 500 miUi- ^h^re A, and respectively represent the target s area and 

second word at 50 targets per second, the speech represen- ^e ahgned region s area m the tmie sirnilanty plane, 

tation used by the presently preferred embodiment employs . con^eQfe score of an umnatched target CGunmatch 

1 27 parameters, which correspond to 5 values per largetxSO f computed m the same way, using an estmiate for the area 

targets per secondx0.5seconds+2 global statistics (average 55 °^ ""^""8 HS region. The estimated area A, is 

match rate and average word duration). computed as the area under the smulanty curve for the 

HG. 13 illustrates the word prototype training procedure 'f'S^' ^ P^°"^,"«= label between the projected locations of 

by which the TC word prototype database 124 is con- ^}^^' ""'^ "S^' f™"^^" , ^ . _ 

structed. The RC word prototype database 122 is constructed The word congruence score is computed as the weighted 

by similar, but far simpler process, in that only the presence 60 °l congruence scores for aU the targets, divided by the 

or absence of an HS region occurring with each of the three ^'^ °'^"!f? No^.^l^^d «)Dgruence scores S^ are 

equal time intervals must be detected. computed by dividmg the mdividual congruence scores by 

Referring to FIG. 13. the HS Region Computation Mod- '^'^ ^^^^ congruence score^The final «»re output by the 

ule 116 is used to convert the similarity time series from the recognizer is a combination of the mformation 

speech database into a list of HS regions. The alignment 65 ^=°gn'«f suge. In the prcsenUy prefetred 

module 130 operates on this list of HS regions to eliminate embodiment the final score output of the recognizer is: 

unreliable regions by alignment across speakers. Again, the s„,^-(^pc*SkY>- 
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APPENDIX 



ml VTAutomatoD 
( 

Lnt State 

) 

{ 

CandidateUst *Lisl; 

SpccchSignal •SimSgl; 

SimilaritySignal •SimSgl; 

int Wordlndcx; 

Lnt SimilarityScore; 

mt SimilarityScorel; 

int SimilarityScorc2; 

Boolean Ok; 

int NbrPhrases; 

DialogueContext DlgCtx; 

int NextState; 

unsigned char Mask; 

UtteranceTrainingList *TrainingList; 

int ID; 

int Ix; 

int NextName Index; 
EmulationEvent Event; 
char Buffcr(DEFSTR+l]; 
char Format(DEFSTR+l]; 
int Offset; 

static Word Model 'Model^NULL; 
static CandidateList * Name List =NULL; 
static SpeechSignal *Rep=NULL; 
static SimilaritySignal *SimRepl=NULL; 
static SimilaritySignal *SimRep2=NULL; 
static SimilaritySignal *SimRep3=NULL4 
static char Digit; 
static unsigned char KeyCode; 
static int Direction; 
static char GivenName[DEFSTR+l]; 
static char GivenNumbeitDEFSTR+l]; 
static int Index; 
static int Namelndex; 
static String Name; 
static char Numbei(DEFSTR+l]; 
static int FirstNamclndex; 
static int LastNamelndcx; 
static int IdleCount; 
static int Number Index; 
SpcSgl»MalIocSpcediSigDalO; 
SimSgl=MaUocSimilarity Signal Q; 
List=MallocCandidate ListQ; 
NameList"(*Scrpt).NameLisl; 
switch (State) 
{ 

case VTS_BOOT: 
{ 

NextState- VTS_BOOT_INTnArE 
} break; 

case VTS_BOOT_JNrnArE: 
{ 

printf("WArnNG FOR SETUP\n"); 
[nitDialQgucContext(&DlgCtx,NODEF); 
DlgCtx.EmuCtx.SetupState-VTS_BOOT_CONFIGURE; 
Getlnput(& DigCtx,Statc,& NextState, List,SpcSgl,SimSgl,& Event); 
} break; 

case VTS_BOOT_CONF1GURE: 
{ 

Update EmulationF\incLionBuffeT(FUNCTION_OVERRIDE_OFF>, 
UpdatcEmuIationF\inctionBuffcr(FUNCnON_KEYAS_CX)MMAND); 
UpdateEmuIationFtinctionBuffer(FUNCnON_^CnVAnON_HANDSET); 
SendEmulationPhnaionBufferQ; 
NexiState=VTS_WAIT; 
} break; 
case VrS_INTERRUPT: 
{ 

QearLCDO; 

UpdateEmulationLEDDisplayBuffer(LED_STAKT_LED#OFF); 
Sc nd Emu lationLEDE>isplay BuffcrO; 
NextState- VTS_WA1T; 
} break; 

case VTS_WAIT: 
{ 
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Mask-CetHaads«tStatusO; 
if(Mask~ST/VnJS_HANDSEr_OFF) 

{ 

Ne«State-VTS_WArr_HANDSEr 

} 

else 
{ 

}NcxtState-VrS_WAIT_STAKr; 
} break; 

case VTS_WArr_HANDSET^ 
{ 

prmtf("WAITrNG FOR HANDSEAn"); 
InitDialogueCoDtext(&Dlgax,NODEF); 
DlgCU.Emuax.HandSetOnStatusStatc=VTS_WArr_START, 
GetInput(&Dlg;Ctx,State,«feNextState,List,SpcSgl,SmiSgl,&Event); 
} break; 

case VTS_WAIT_STAKr: 
{ 

printf("WArnNG FOR ALIBABA MODE\n"); 
InitDiaiogueContext<&Dlgax,NODEF); 
DlgCtx.EmuCtx.ModeOnStatusState-VTS_WArr_ENGAGE; 
DlgCtx.Emuax.HandSetOffStatusState-VTS_WAIT_HANDSEr, 
GetInput(&DlgCtx,Statc,&Ncxtstatc,List,SpcSg],SiinSgl,&Event); 
} break; 

case VTS_WAIT_ENGAGE: 
{ 

LlpdateEmulationLEDDisplayBuffer(LED_START_LED_ON); 

SeDdEmulationLEDDisplayBuffeiQ; 

QearLCDO; 

DisprayLCD(LCD_LINE_l, "WELCOME... "); 
PlayMessage ("Welcome"); 
NextState-VTS_CALL; 
} break; 

caseVTS_TOP: 



What is claimed is: 

1. A naultimodal telephone comprising: 35 

a telephone unit having a microphone and speaker for 
supporting voice communication by a user; 

a visual display device disposed on said telephone unit, 
the display adapted for displaying a plurality of differ- 
ent command prompts to the user; ^ 

at least one programmable function key for enabling entry 
of keyed commands, said function key disposed on said 
telephone unit adjacent said visual display such that at 
least a portion of said command prompts are displayed 
approximately adjacent said function key; 

a speech module disposed within said telephone unit, the 
speech module including a speech recognizer and a 
speech generator, the speech module being coupled to 
said telephone unit so that said speech recognizer is 
responsive to voiced commands entered through said 
microphone and said speech synthesizer provides 
audible prompts through said speaker; 

a dialog manager coupled to said visual display, to said 
function key and to said speech module, said dialog 55 
manager defining a set of linked control function states 
each state associated with a respective one of said 
command prompts and at least a portion of said set of 
linked control function states being fiirther associated 
with a respective one of said audible prompts; 

said dialog manager being responsive to said voiced 
commands and to said function key to traverse said set 
of linked control function states to select one of said set 
of linked control function states as an active state; 

said dialog manager being operative to maintain synchro- 65 
nism between said command prompts and said audible 
prompts such that the control function states linked to 



said active state are displayed as a command prompts 
and the user has the option to move from said active 
state to one of said control function states linked to said 
active state by either voiced command or keyed com- 
mand; and 

wherein said dialog manager stores a dialog context and 
wherein said speech recognizer selects a plurahty of 
word candidates in response to voiced commands and 
uses said dialog context to select among said plurality 
of word candidates. 

2. The telephone of claim 1 wherein said set of linked 
control function states define a state machine and wherein 
said dialog manager administers said state machine. 

3. The telephone of claim 1 wherein said speech module 
further includes a voice dialing module having database of 
user-defined names and phone numbers. 

4. The telephone of claim 1 wherein said dialog manager 
stores a dialog context and wherein said dialog manager 
communicates said dialog context to said telephone. 

5. The telephone of claim 1 wherein said dialog manager 
stores a dialog context and wherein said dialog manager 
communicates said dialog context to said telephone unit for 
presentation on said visual display. 

6. The telephone of claim 4 wherein said telephone unit 
includes a phone processor coupled to said visual display 
and wherein said dialog manager communicates said dialog 
context to said phone processor. 

7. The telephone of claim 1 wherein said speech recog- 
nizer represents speech as high phoneme similarity values. 

8. The telephone of claim 1 wherein said speech recog- 
nizer employs a region count stage that extracts a list of 
word candidates based on regions of high phoneme simi- 
larity values. 

9. The telephone of claim 1 wherein said speech recog- 
nizer employs a target congruence stage that extracts a list 
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of word candidates based on the alignment of regions of high 
phoneme similarity with word prototypes, 

10. The telephone of claim 1 wherein said speech recog- 
nizer employs: 

a region count stage that extracts a first list of word ^ 
candidates based on number regions of high phoneme 
similarity values, and 

a target congruence stage that extracts a second list of 
word candidates from said first list based on alignment 
of regions of high phoneme similarity values with word 
prototypes. lO 

11. The telephone of claim 1 wherein said dialog manager 
includes a memory for storing a database comprising both 
speaker independent and speaker dependent information. 

12. The telephone of claim 1 wherein said telephone unit 
includes a phone processor module programmed for opera- 
tive communication with a telephone branch exchange sys- 
tem. 
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13. A multimodal telephone comprising: 

a telephone unit having a microphone for supporting 
voice communication by a user; 

a speech module coupled to said telephone unit, the 
speech module including a speech recognizer that is 
responsive to voiced commands entered through said 
microphone; 

a dialog manager coupled to said speech module that 
stores a dialog context and wherein said speech recog- 
nizer selects a plurality of word candidates in response 
to said voiced commands and uses said dialog context 
to select among said plurality of word candidates. 

***** 
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